Sufficient Dimensionality Reduction with Irrelevant Statistics

نویسندگان

Amir Globerson

Gal Chechik

Naftali Tishby

چکیده

The problem of unsupervised dimensionality reduction of stochastic variables while pre serving their most relevant characteristics is fundamental for the analysis of complex data. Unfortunately, this problem is ill defined since natural datasets inherently contain al ternative underlying structures. In this paper we address this problem by extending the re cently introduced "Sufficient Dimensionality Reduction" feature extraction method [7], to use "side information" about irrelevant struc tures in the data. The use of such irrelevance information was recently successfully demon strated in the context of clustering via the Information Bottleneck method [1]. Here we use this side-information framework to iden tify continuous features whose measurements are maximally informative for the main data set, but carry as little information as possi ble on the irrelevance data set. In statistical terms this can be understood as extracting statistics which are maximally sufficient for the main dataset, while simultaneously max imally ancillary for the irrelevance dataset. We formulate this problem as a tradeoff op timization problem and describe its analytic and algorithmic solutions. Our method is demonstrated on a synthetic example and on a real world application of face images, show ing its superiority over other methods such as Oriented Principal Component Analysis.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Sufficient Dimensionality Reduction with Irrelevance Statistics

متن کامل

A sequential test for variable selection in high dimensional complex data

Given a high dimensional p-vector of continuous predictors X and a univariate response Y , principal fitted components (PFC) provide a sufficient reduction of X that retains all regression information about Y in X while reducing the dimensionality. The reduction is a set of linear combinations of all the p predictors, where with the use of a flexible set of basis functions, predictors related t...

متن کامل

A Monte Carlo-Based Search Strategy for Dimensionality Reduction in Performance Tuning Parameters

Redundant and irrelevant features in high dimensional data increase the complexity in underlying mathematical models. It is necessary to conduct pre-processing steps that search for the most relevant features in order to reduce the dimensionality of the data. This study made use of a meta-heuristic search approach which uses lightweight random simulations to balance between the exploitation of ...

متن کامل

5 Approximate Nearest Neighbor Regression in Very High Dimensions

Fast and approximate nearest-neighbor search methods have recently become popular for scaling nonparameteric regression to more complex and high-dimensional applications. As an alternative to fast nearest neighbor search, training data can also be incorporated online into appropriate sufficient statistics and adaptive data structures, such that approximate nearestneighbor predictions can be acc...

متن کامل

Additive Regression Splines With Irrelevant Categorical and Continuous Regressors

We consider the problem of estimating a relationship using semiparametric additive regression splines when there exist both continuous and categorical regressors, some of which are irrelevant but this is not known a priori. We show that choosing the spline degree, number of subintervals, and bandwidths via cross-validation can automatically remove irrelevant regressors, thereby delivering ‘auto...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

CoRR

دوره abs/1212.2483 شماره

صفحات -

تاریخ انتشار 2011

Sufficient Dimensionality Reduction with Irrelevant Statistics

نویسندگان

چکیده

منابع مشابه

Sufficient Dimensionality Reduction with Irrelevance Statistics

A sequential test for variable selection in high dimensional complex data

A Monte Carlo-Based Search Strategy for Dimensionality Reduction in Performance Tuning Parameters

5 Approximate Nearest Neighbor Regression in Very High Dimensions

Additive Regression Splines With Irrelevant Categorical and Continuous Regressors

عنوان ژورنال:

اشتراک گذاری